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Abstract 

We propose a novel actor-critic algorithm with guaranteed convergence to an optimal policy for a 
discounted reward Markov decision process. The actor incorporates a descent direction that is motivated 
by the solution of a certain non-linear optimization problem. We also discuss an extension to incor¬ 
porate function approximation and demonstrate the practicality of our algorithms on a network routing 
application. 


1 Introduction 


We consider a discounted MDP with state space S , action space A, both assumed to be finite. A randomized 
policy 7r specifies how actions are chosen, i.e., tt(.s), for any s £ S is a distribution over the actions A. The 
objective is to find the optimal policy n* that is defined as follows: 


7 r*(s) = argmax ^ v n (s) := E 

7rgn 


X r ( s n,aMsn,a)|so = s 

n a£A(s„) 


(1) 


where r(s, a) is the instantaneous reward obtained in state s upon choosing action a, /3 £ (0,1) is the 
discount factor and II is the set of all admissible policies. We shall use v*(= v n ) to denote the optimal 
value function. 

Actor-critic algorithms (cf. 0], 0] and [9j) are popular stochastic approximation variants of the well- 
known policy iteration procedure for solving dTJ. The critic recursion provides estimates of the value func¬ 
tion using the well-known temporal-difference (TD) algorithm, while the actor recursion performs a gradient 
search over the policy space. We propose an actor-critic algorithm with a novel descent direction for the 
actor recursion. The novelty of our approach is that we can motivate the actor-recursion in the following 
manner: the descent direction for the actor update is such that it (globally) minimizes the objective of a 
non-linear optimization problem, whose minima coincide with the optimal policy 7r*. This descent direction 
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is similar to that used in Algorithm 2 in 0], except that we use a different exponent for the policy and a sim¬ 
ilar interpretation can be used to explain Algorithm 2 (and also 5) of llsh- Using multi-timescale stochastic 
approximation, we provide global convergence guarantees for our algorithm. 

While the proposed algorithm is for the case of full state representations, we also briefly discuss a 
function approximation variant of the same. Further, we conduct numerical experiments on a shortest-path 
network problem. From the results, we observe that our actor-critic algorithm performs on par with the 
well-known Q-learning algorithm on a smaller-sized network, while on a larger-sized network, the function 
approximation variant of our algorithm does better than the algorithm in [1;]. 


2 The Non-Linear Optimization Problem 

With an objective of finding the optimal value and policy tuple, we formulate the following problem: 


min min ( J(v, i r) 
dgrI 5 ! y 

s.t. Vs <E S, a £ A 


E [K s ) - E vr(s,a)Q(s,a)] 

sG«S gl£./4. 




(a)7r(s,a) > 0, (6) 7V (s,a ) = 1, and ( c)g(s,a) < 0. 


(2) 


In the above, g(s,a ) := Q(s,a) — n(s), with Q(s,a ) := r(s,a) + (3 ^p(s'|s, a)v(s'). Here p(s'|s,a) 

s' 

denotes the probability of a transition from state s to s' upon choosing action a. 

The objective in © is to ensure that there is no Bellman error, i.e., the value estimates v are correct 
for the policy tt. The constraints ( |2(a)[ )-( [2(h)[ ) ensure that n is a distribution, while the constraint ( |2(c)[ ) 
is a proxy for the max in ©. Notice that the non-linear problem © has a quadratic objective and linear 
constraints. 

From the definition of tt* , it is easy to infer the following claim: 


Theorem 1. Let g*(s , a) := Q*(s , a) — v*(s), with Q*(s , a) := r(s, a) + /3 Ep( s/ I s ; cl)v*(s'), Vs € 5, a G 

s' 

A. Then , 

(i) Any feasible (v*. tt* ) is optimal in the sense of © if and only if J(v* ,ir*) = 0. 

(ii) tt* is an optimal policy if and only if-JT*(s, a)g*(s, a) = 0, Va € A, s G S. 


3 Descent direction. 


Proposition 1 . For the objective in ©, the direction \J'tt(s, a)g(s. a) is a non-ascent and in particular, a 
descent direction along i r(s, a) if yVr(s, a)g(s, a) A 0> f or all s £ S, a £ A. 

Proof Consider any action a £ A for some s £ S. We show that \J tt(s, a)g(s, a) is a descent direction by 
the following Taylor series argument. Let 

7r(s, a) = 7r(s, a) + 6y/ t r(s, a)g(s, a), 


for a small S > 0. We define tt to be the same as n except with the probability of picking action a in state 
s £ S being changed to 7r(s, a) (and the rest staying the same). Then by Taylor’s expansion of ./(tt) upto 
the first order term, we have that 


J(v,7r) 


J(v,tt) + 5\/ 7r(s, a)g(s, a) 


dJ(v, tt) 
dn(s, a) 
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Note that higher order terms arc all zero since J(v,tt) is linear in 7 r. It should be easy to see from definition 

Qj( y \ 

of the objective that ——-—- = —g(s, a). So, 
oir(s,a) 

J{v,n) = J(v,tt) - 5^/ir(s,a)(g(s,a)) 2 . 

Thus, for a £ A and s £ S where 7r(s,a) > 0 and g(s,a ) A 0, J(v, tt) < J(v,tt), while when 
\Jtt(s, a)g(s, a) = 0, J{v,tt) = J(y,tr). □ 

The next section utilizes the descent direction to derive an actor-critic algorithm. 


4 The Actor-Critic Algorithm 

Combining the descent procedure in tt from the previous section, with a TD( 0) [11] type update for the 
value function v on a faster time-scale, we have the following update scheme: 

Q-Value: Q n (s,a) = r(s,a) + fives'), TD Error: g n (s, a) = Q n {s, a) - v n (s), 

Critic: v n+ \(s) = v n (s) + c(n)g n (s,a), Actor: 7r n+ i(s,a) =T^7r n (s, a) + b(n)y/ir n (s,a)g n (s,a) 

(3) 


In the above, T is a projection operator that ensures that the updates to 7r stay within the simplex V = 

g 

{(xi,... ,x q ) | Xi > 0, Vi = 1,..., q, x j < 1}, where q = \A\. Further, the step-sizes b(n) and c(n) 

3 = 1 


satisfy 


OO 


OO 


OO 


y] c(n) = ^ b(n) = 00 , ^ (c 2 (n) + b 2 {n)) < 00 and b(n) = o(c(n)). 

n= 1 n =1 n=l 


Remark 1. (Connection to Algorithm 2 of f$j) From Proposition \T\ we have that \J 1 r(s, a)g(s, a) is a 
descent direction for 1 r(s,o). This implies tt (s,a) a x yj tt(s, a)g(s, a) for any a > 0, is also a descent 
direction. Hence, 


a generic update rule for tt is: 7r n -|_i(s, a) = T ^7r n (s, a) + b(n){iT n {s, a)) a g n {s, o)^ 

The special case of a' = 1 coincides with the tt -recursion in Algorithm 2 of /§/. 


for any a' > 


1 

2' 


5 Convergence Analysis 

For the purpose of analysis, we assume that the underlying Markov chain for any policy tt £ FI is irreducible. 

Main result Let v 77 = [I — fPff 1 R n , where R n = (r(s , tt), s € S) 1 is the column vector of rewards 
and P n = \p(y\s, tt),s £ S,y £ S] is the transition probability matrix, both for a given tt. Consider the 
ODE: 

(\/7r(s,a)g ,r (s,a)^ ,Va € A, s £ S, where (4) 

g n (s,a) :=r{s,a) + p ^ p(y\s,a)v n (y) - v n {s). (5) 

yeu(s) 
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, for any continuous 


In the above, T is a projection operator defined by r(e(7r)) := lim 

aj.0 


r(vr + ae(7r)) — n 


a 


Theorem 2. Let K denote the set of all equilibria of the ODE (0), G the set of all feasible points of the 
problem © and K := K n G. Then, the iterates (v n . n n ),n > 0 governed by © satisfy 


(v n , Tt n ) —> K* a-s. as n —y oo, where K* = {(v*, tt*) \ n* G K}. 


The algorithm © comprises of updates to v on the faster time-scale and to vr on the slower time-scale. 
Using the theory of two time-scale stochastic approximation Ifi, Chapter 6], we sketch the convergence 
of these recursions as well as prove global optimality in the following steps (the reader is referred to the 
appendix for proof details): 


Step 1: Critic Convergence We assume t r to be time-invariant owing to time-scale separation. Consider 
the ODE: 

= r(s, n) + (3 ^2 K 5 'I s ; y) - v(s), Vs € S , (6) 

s'eS 

where r(s,7r) = Sae^t 7r (' s > a ) r ( s > °) and p(s'|s, tt) = S a£v 4 7r(s, a)p(s'\s, a). It is well-known (cf. |Q]) 
that the above ODE has a unique globally asymptotically stable equilibrium v n . We now have the main 
result regarding the convergence of v n on the faster time-scale. 

Theorem 3. For a given it, the critic recursion in © satisfies v n s iG a.s. as n oc. 


Step 2: Actor Convergence Due to timescale separation, we can assume that the critic has converged in 
the analysis of the actor recursion. We first provide a useful characterization for the set K of equilibria of 
the ODE ©. 

Lemma 4. Let L = {7r 17r(s) is a probability vector over A, V s € 5} denote the set of policies that are 
distribu tions over the actions for each state. Then, 

tt € K if and only iftr G L and \J 7 r(s, a)g 7r (s, a) = 0, Va G A, s G S. 


From Lemma[4[ the set K can be redefined as follows: I\ = | n G L \/tt(s, a)g(s, a) = 0, Va G A, s 
The set K can be partitioned using the feasible set G of © as K = K LJ K c , where K = K D G. 
Lemma 5. All tt* G K c are unstable equilibrium points of the system of ODEs ©. 



Proof For any 7r* G K c , there exists some a G A(s),s G S, such that 9 n (s,a) > 0 and 7r(s, a) = 0 
because K c is not in the feasible set G. Let Bs(tt*) = {tt G L\ ||7r — 7t*|| < b}. Choose 5 > 0 such that 
g*(s, a) > 0 for all 7r G B$(tt*) \ K. So, f {^Jt r(s, a)g n (s, a)) > 0 for any 7r G B$(tt*) \ K which suggests 
that 7r (s,a) will be increasingly moving away from 7r*. Thus, tt* is an unstable equilibrium point for the 
system of ODEs ©. □ 

Remark 2. (G = K) We already have that I\ C G. So, it is sufficient to show that G C K. A pol¬ 
icy tt belongs to G if 9 *(a, a) < 0 for all a G A(s) and s G S. By definition, iG is obtained from 
SaeA(s) 7r ( s ’ a )fl ,7r ( s ) a ) = 0, Vs G S. Since each term in the summation is negative, we have that 


7r(s, a)g n (s, a) = 0 = \J tt(s, a)g w (s , a), Va G Vl(s), s G S and hence G = K. 
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Proof of Theorem |2] 

Proof. The update of tt on the slower time-scale can be re-written as 

ir n+1 (s,a) =T (n n (s,a)+ b(n)(H(ir n )+ 7j n )), where (7) 


H(ir n ) = y / vr n (s, a)g 7T (s, a) and q n = yViyfs, a)g n (s, a) — H(ir n ). We can infer the claim regarding 
convergence of 7r n governed by ([7) using Kushner-Clark lemma (Theorem 2.3.1 in BlOfl ), if we verify the 
following: 

(i) H is a continuous function, (ii) The sequence tj n , n > 0 is a bounded random sequence with //.„ —y 0 
almost surely as n —> oo. (iii) The step-sizes b(n),n > 0 satisfy b(n) —> 0 as n —> oo and Yl n b( n ) = 00 ■ 
Now, (i) follows by definition of H and (iii) by assumption on step-sizes. Consider (ii): rj n is bounded 
since we consider a finite state-action space setting (=>■ g(s, a) is bounded) and tt is trivially upper-bounded. 
From Theorem^ v n —> v n a.s. as n oc and hence, r/ n —> 0 a.s. The claim follows. □ 


Remark 3. (Avoidance of traps) Note that from the foregoing, the set K comprises of both stable and 
unstable attractors and in principle from Lemma [5] the iterates ir n governed by (0 can converge to an 
unstable equilibrium. A standard trick to avoid such traps, as discussed in Chapter 4 of /U/, is to introduce 
additional noise in the iterates. For this purpose, we perturb the policy every r > 0 iterations to obtain a 
new policy tt as follows: 


tt(s, a) 


7r ^ fl ) + 7 a€A 
E(ir(s,a)+ V y 


(B) 




The above scheme ensures that the convergence of the policy sequence ir n governed by © is to the stable 
set K. 


Step 3: Global Optimality Here we establish that our algorithm converges to a globally optimal policy. 

Lemma 6 . If tt £ K, then tt is globally optimal and the corresponding value function 'if 1 is the same as the 
optimal value v*. 

Proof. 

If 7 t(s, a ) > 0, then g(s, a) = 0 => v w (s) = r(s, a) + (3 p(y\s, a)v 7T (y). 

yeU(s) 

If 7 t(s, a) = 0, then g(s, a) < 0 => v n (s) > r(s, a) + (3 p{y\s, a)v 7r (y). 

yeU(s) 


Thus, it follows that Vs E S, v n (s) 


max 

ag^l(s) 


r(s,a) + P ^2 p(y\ s ’ a ) vn (y) 

yeU(s) 


□ 


6 Extension to incorporate function approximation 

The actor-critic algorithm described in Section [4] is infeasible for implementation in high-dimensional set¬ 
tings where the state and action spaces are large. A standard approach to alleviate this problem is to employ 
function approximation techniques and parameterize the value function and policies as follows: 
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Value function Using a linear architecture, the value function is approximated as v n (s) ~ f(s) T w, for 
any given policy 7 r. Here f(s) is the state feature vector and w is the value fimction parameter, both in some 
low-dimensional subspace R dl , with d\ « |<S|. 


Policies We consider a parameterized class of policies such that each policy is continuously differentiable 
in its parameter. A common approach is to employ the Boltzmann distribution to obtain the following form 


for policies: ir e (s,a) 


0 6 T <f>(s,a) 


e 0l >( s ’ fe ) 

beA 

parameter vector, both assumed to be in a compact subset C £ IV 1 ' 2 . 


. Here f(s, a) is a state-action feature vector and 9 is the policy 


Update rule Choose a n ~ tt°" (•, s m ) and observe the reward r(s n , a n ). Then, update the critic parameter 
w n and policy parameter 9 n as follows: 


TD Error: £ln{s ni a, n ') .— ^(5^,0^) H- / 3 /(sn+i) w n f(sn') tv n , 
Critic: w n+1 = w n + c(n)g n (s n , a n )f(s n ), 

Actor: 0ji +1 — R($n + b(jl)lT n (s n , Cln) ^ V’n('Sri) ®n)fl l n('Sn> ®n)) • 


(9) 

( 10 ) 

( 11 ) 


In the above, f projects any 8 onto a compact set C C R^ 2 and ip n (s n , 
compatible features. For Boltzmann policies, f> n {s n ,a n ) = 4> n (s n ,a n ) — 


0* n) — 


9 log TT n (s n ,a r , 




" 7Tn(®m b)(j) n (s n , 6). 


are the 


beA 

The critic recursion above follows from the standard TD(0) with function approximation update. The 
idea is to have the increment Aw n oc [vt(s n ) — f(s n ) T w n ] 2 , where vt(s n ) = r(s n , a n ) + (3f(s n+ i) J w n is 
the current estimate of the return. A natural update increment for the actor recursion is to have 
dJ dJ dir n ,— -- 

OC v^n\Snittn)9n\Sn)ttn)'Kn\SniQ j n)'lpn\Sri)Q j n)- 


d0 n 


d7T n d Or. 


Preliminary result: 

In addition to irreducibility of the underlying Markov chain for any policy and differentiability of the policy, 
we assume that the feature matrix <J> with rows /(s) T , Vs £ S is full rank. These assumptions are standard 
in the analysis of actor-critic algorithms (cf. 0]). Let d wB (s) = (1 - p) P n Pr ( s n = s|s 0 ; 7T e ) for any 
policy 9 C C. Let K denote the set of all equilibria of the ODE: 


0(t) = f ( ( s ) 7reW ( s > a ) V7reW ( r (' s ’ a ) +P ^2 p ( s ' I s ’ a ) w<>{t)T f( s ') ~ w 9(t)T f(s )) ) . 


cl£*A. 


s'es 


( 12 ) 


Theorem 7. The iterates (w n , 9 n ),n > 0 governed by (fTTb satisfy 

( w n , 9 n ) -A- K a.s. as n -A 00 , where K = {(w e , 9) \ 9 £ K}. 


In the above, iv s is the solution to Aw° = b, where A = T 1 '■I’oif — fP)^ and b = 'I )T T qt with 0 is a 
diagonal matrix with the stationary distribution of the Markov chain underlying policy with parameter 9 as 
the diagonal entries and r is a column vector with entries tt 9 (s, a)r(s, a), for each s £ S. 
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Figure 1: Network graphs with associated rewards 


Node 

Value 

function 

MPA 1 

Probability 


Node 

Q(s,l) 

Q(s,2) 

Q(s,3) 

Q(s,4) 

1 

-17.83 

2 

0.87 


1 

-24.4 

-15.72 

-20.376 

N.A 

2 

-19.64 

2 

0.96 


2 

-25.72 

-16.72 

-19.576 

N.A 

3 

-9.24 

1 

0.95 


3 

-8.4 

-15.8 

-23.376 

-21.576 

4 

-6.00 

1 

0.96 


4 

-6 

-17.72 

-32.376 

N.A 

5 

-8.22 

1 

0.92 


5 

-8 

-8.72 

-30.576 

N.A 


(a) AC-OPT algorithm (b) Q-learning algorithm 


Figure 2: Performance of Q-learning and actor-critic algorithms on six node network graph 


7 Simulation Experiments 

Setup Routing packets through a communication network is a natural application for reinforcement learn¬ 
ing algorithms. Q-routing, that is, using Q-learning for routing packets in dynamically changing networks 
has been investigated among others by |@] and [3]. We have considered a highly simplified version of the 
problem over two network graph settings: 

Six node graph As shown in Fig. [Taj the state space here consists of the nodes themselves, that is S = 
{1, 2, 3,4, 5,6}, and the number of actions in a state corresponds to the number of neighbouring nodes 
to which a packet can be routed from the given node. The next state is chosen randomly and node 6 
is the absorbing destination node. Further, each run started from state 1 and the initial estimate of the 
Q-value was 0 for all states. Rewards in each transition are negative of the edge weight (as depicted 
in Fig. [Tal i. 

44 node graph As shown in Fig. [TQ the state space here is S = {0,1,2,.,43,44}, with 44 being 

the destination node. The actions are as follows: at any node start from direction east and move in 
clockwise direction. 1 st action is a0, second action is al and so on. For all actions, rewards are shown 
in Fig. [lbl 

On these two settings, we implemented both the Q-learning and our actor-critic algorithm (henceforth, 
referred to as AC-OPT). For both algorithms, we set the discount factor (5 = 0.8. The initial randomized 

'MPA stands for "Most probable action". 
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Value function 



Node number 


(a) Value function 




Figure 3: Performance comparison on a 44-node network graph 


























































































































































































































policy was set to the uniform distribution. For AC-OPT, the policy was perturbed every r = 10 iterations 
(see Remark|3]). All the results presented are averaged over 50 independent runs of the respective algorithm. 

Results The tales in Figs. I2al - [2b1 present the results obtained upon convergence of the AC-OPT and Q- 
learning algorithms for the six node network graph setting, respectively. It is evident that both algorithms 
converge to the optimal policy. While Q-learning recommends the best action using Q-values, AC-OPT, 
being randomized, suggests the optimal action with high probability. 

Fig. [3a] presents the value function estimates obtained from both algorithms on the 44 node network 
graph, while Fig. [3b] compares the actions suggested by both algorithms upon convergence, for each 
state(=node) in the network graph. It is evident that AC-OPT recommends the same (as well as optimal) 
actions as Q-learning on almost all the states. Even though there is change in the recommended actions on 
a small number of states, the difference in value estimates here is negligible. 

Function approximation We show here the results the function approximation valiant of our actor-critic 
algorithm (henceforth referred to as AC-OPT-FA) and the RPAFA-2 algorithm from |[T|]. For any state s, let 
a = L§J and b = s mod 9. Then, the state features are chosen as: /(s) = (4 — a, 8 — b, 4 + a — 5,1) T . 
Along similar lines, the state-action feature f(s, a) = (4 — a, 8 — b, 4 + a — b, r(x , y), 1) T . 

Fig. [3c]compares the actions recommended by AC-OPT-FA and RPAFA-2 algorithms, while also high¬ 
lighting the sub-optimal actions. It is evident that AC-OPT-FA recommends with high probability (~ 0.9 
on the average) the best action with a 93% accuracy. On the other hand. RPAFA-2 achieved only a 50% 
accuracy, i.e., sub-optimal actions suggested over half of the state space. 

8 Conclusions 

In this paper, we proposed a new actor-critic algorithm with guaranteed convergence to the optimal pol¬ 
icy in a discounted MDP The proposed algorithm was validated through simulations on a simple shortest 
path problem in networks. A topic of future study is to strengthen the convergence result of the function 
approximation variant of our actor-critic algorithm. 


Appendix 


A Proofs for the actor-critic algorithm 

Lemma 8. Let R n = (r(s, n), s £ S) T be a column vector of rewards and P n = \p(y\s, tt), s £ S,y £ S] 
be the transition probability matrix, both for a given it. Then, the system of ODEs © has a unique globally 
asymptotically stable equilibrium given by 

v, = [I-PP V \~ 1 R K . (13) 


Proof The system of ODEs ([6]) can be re-written in vector form as given below. 


— = R n + /3P n v - v. 
dt 

Rearranging terms, we get 

nv 

— = ^ + 08 P n - I)v, 


( 14 ) 
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where I is the identity matrix of suitable dimension. Note that for a fixed -k, this ODE is linear in v and 
moreover, all the eigenvalues of (/ 3P n — I) have negative real parts. Thus by standard linear systems theory, 
the above ODE has a unique globally asymptotically stable equilibrium which can be computed by setting 

dy 

— = 0, that is, R n + (j3P n — I)v = 0. The trajectories of the ODE (fldl) converge to the above equilibrium 
starting from any initial condition in lieu of the above. □ 

Proof of Theorem [3] 

For establishing the proof, we require the notion of (T, 5)-perturbation of an ODE, defined as follows: 
Definition 1 . Consider the ODE 


x(t) = f(x(t)). (15) 

Given T, 6 > 0, we say that x(-) is a (T, 5)-perturbation of (fT5T) . if there exist 0 = 7o <7) <T> <■ ■ ■ < 
T n f oo such that T n+ \ — T n > T,for all n > 0 and sup te r Tri)Tn+1 i || x(t) — x(t) || < <5, for all n > 0. 

Let Z be the globally asymptotically stable attractor set for (fl5l ) and Z e be the e-neighborhood of Z. 
Then, the following lemma by Hirsch (see Theorem 1 on pp. 339 of iQ]) is useful in establishing the 
convergence of a (T, 5 (-perturbation to the limit set Z e . 

Lemma 9 (Hirsch Lemma). Given e, T > 0, 36 > 0 such that for all 6 £ (0, 5), every (T, 6)-perturbation 
of a a converges to Z e . 

Proof ( Theorem © Fix a state s £ S. Let {n} represent a sub-sequence of iterations in algorithm © when 
the state is s £ S. Also, let Q n = {h : n < n}. For a given 7r, the updates of v on the slower time-scale 
{c(n )} given in algorithm © can be re-written as 


«n+i( s ) = v n (s) + c(n) 


'y ^ a,)g nn (s, a) + \ n , 

aGA(s) 


(16) 


where Xn = r(s,a ) + f3v n {s') - Y, 7r fi (s,a)g 7Tfi (s,a), is the noise term. Let M n = Y 

aGA(s) m&Q n 

Then, M n ,n > 0, is a convergent martingale sequence by the martingale convergence theorem (since 
Y c 2 {n) < oo and ||g|| = a)| < oo). The equation (fl6l) can now be seen to be a (T, 6 )-perturbation 

n 

of the system of ODEs ©. Thus, by Lemma[9] it can be seen that v n converges to the globally asymptotically 
stable equilibrium v n (see equation £©) of the system of ODEs ©. □ 


Proof of Lemma |4| 

Proof 

If part: If 7 x £ L and a)g n (s , a) = 0,\/a £ A, s £ S holds, then by definition of operators T and T, 

the result follows. 

Only if part: The operator f, by definition, ensures that it £ L. Suppose for some a £ ^4(s),s £ S , 
we have T(y / 7t(s, a)) = 0 but y/i r(s, a)g n (s, a) / 0. Then, g 7T (s,a) / 0 and since n £ L, 

1 > 7r(s, a) > 0. We analyze this by considering the following two cases: 

(i) 1 > 7r(s,a) > 0 and g 7r (s,a) / 0: In this case, it is possible to find a A > 0 such that for all 

6 < A, 

1 > 7r(s, a) + 6\J 7r(s, ^g^^s, a) > 0. 
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This implies that 


T (V 7 T(s,a)g w (s,a)^ = y/n(s,a)g 7r (s,a) f 0, 

which contradicts the initial supposition. 

(ii) 7T (s,a) = 1 and g n (s,a) 0: Since v n is solution to the system of ODEs ©, the following 
should hold: 

y, vr(s, a)g w (s, d) = 7 r(s, a)g n (s, a) = 0. 
aGA(s) 

This again leads to a contradiction. 

The result follows. □ 


B Proofs for the function approximation variant 

Proof of Theorem [7] 

Proof. Due to timescale separation, we can assume that the policy parameter 6 is constant for the sake of 
analysis of the critic recursion in (fTTI) . For any fixed policy given as parameter 9 , the critic recursion in (fTTI ) 
converges to w B , which is the TD fixed point (see Theorem [7] statement for the explicit form of w°). This is 
a standard claim for TD(0) with function approximation - see [12] for a detailed proof. 

Let J- n = cr(9 m , m < n ). The actor recursion ([171 ) in the main paper can be re-written as 


9n +1 —f (^9 n T 6(n)E[7r ri (s T i ) a n ) ^ 7p n (s n ,ci n )g(s n ,(i n ) | J~ n] 

+ b(n) (^7T n (s n , a n ) t V’n(Sri) CLn)gn{.Sni ®n) CL n ) ^ 7jj n (s n , CL n )g n (s n , CL n ) | 

+ 6(n)IE 7T n (s n: (l n ) ^ fn(Sni ®n) {^9n(Sm ®n) p(s n , fl n )) | J~ n (17) 

where g(s , a) := r(s, a) + /3 J2 s i eS p(s' \ s, a)w e< ^ T f (s') — w 9 ^ f(s). 

Since the critic converges, i.e., w n —» w° a.s. as n —»• oc, the last term in (1171) vanishes asymptotically. 
Let M n — ^m(Smi Q>rn) ^ 'fm(Srm ®m) ^ ^m(SmFm) ^ Q"m)9m(Smi Q>rn) \ 

F n ], Using arguments similar to the proof of Theorem 2 in |4Q, it can be seen that M n is a convergent 
martingale sequence that converges to zero. So, that leaves out the first term multiplying b(n) in (fTTI) . A 
simple calculation shows that 

®[^n(Sn)fln) ^ ^n(^m ®ra)s(®raj ®n) I -7~"J 

= (s) J2 ^ «) Vvr e W ( r(s, a) + ^ p(s' \ s, a)w 9 ^ T f(s') - w 9 ^ T f(s )). 


ses 


A 


s'GS 


The rest of the proof amounts to showing that the RHS above is Lipschitz continuous and that the recursion 
(fTTI ) is a (T, <5) perturbation of the ODE (fl2l) in the main paper. These facts can be verified in a similar 
manner as in the proof of Theorem 2 in |[4[] and the final claim follows from Hirsch lemma (see Lemma [9] 
above). □ 


11 








C Simulation Experiments 


Results for full state representation based algorithms on 44 node graph 

Tables. |T]-[2] present detailed results for our AC-OPT algorithm and Q-learning, respectively on the 44-node 
network graph setting. For Q-learning results in Table[2j the action achieving the maximum in max a Q(s, a)) 
is boldened. It is evident that AC-OPT suggests the same (as well as optimal) actions as that of Q-learning, 
on almost all the states. 


Node no. 

Value function 

MPA: Probability 

Node no. 

Value function 

MPA: Probability 

0 

-40.824 

0 : 0.974759 

22 

-27.6105 

0 : 0.952729 

1 

-39.7619 

0 : 0.940369 

23 

-23.6213 

1 : 0.965307 

2 

-38.3387 

0 : 0.954584 

24 

-19.3607 

1 : 0.956485 

3 

-37.1019 

0 : 0.934279 

25 

-25.1828 

1 : 0.917481 

4 

-35.8406 

1 : 0.977405 

26 

-19.9879 

1 : 0.973978 

5 

-37.5327 

4 : 0.775096 

27 

-32.8828 

0 : 0.962421 

6 

-35.618 

3 : 0.726475 

28 

-30.5635 

0 : 0.963262 

7 

-36.8312 

0 : 0.699411 

29 

-28.1035 

0 : 0.935406 

8 

-35.2874 

3 : 0.986148 

30 

-25.5654 

0 : 0.951051 

9 

-38.3211 

0 : 0.966336 

31 

-22.8029 

0 : 0.965918 

10 

-37.9592 

0 : 0.937302 

32 

-18.8625 

0 : 0.955858 

11 

-36.0614 

0 : 0.959576 

33 

-14.5632 

1 : 0.929352 

12 

-33.4332 

0 : 0.95668 

34 

-10.0406 

1 : 0.9742 

13 

-31.1697 

0 : 0.961255 

35 

-16.8062 

0 : 0.928148 

14 

-28.057 

1 : 0.95864 

36 

-29.7862 

0 : 0.989813 

15 

-30.1452 

0 : 0.951196 

37 

-27.6444 

0 : 0.966042 

16 

-28.4007 

3 : 0.940799 

38 

-25.6189 

0 : 0.94836 

17 

-30.4659 

1 : 0.863991 

39 

-23.6847 

0 : 0.972548 

18 

-38.2062 

1 : 0.937154 

40 

-19.5683 

0 : 0.99494 

19 

-35.7315 

1 : 0.94369 

41 

-14.0438 

0 : 0.981092 

20 

-33.0474 

1 : 0.930422 

42 

-9.6131 

0 : 0.994136 

21 

-30.2144 

0 : 0.941161 

43 

-5.00005 

0 : 0.939764 


Table 1: Performance of the AC-OPT algorithm (MPA stands for “most probable action”) on the 44-node 
network graph 


Results for function approximation based algorithms 

Tables. |3]-|4]present the detailed results for the function approximation based algorithms: RPAFA-2 from [Q]] 
and our AC-OPT-FA. States that arc shown in bold in these tables correspond to those where the respective 
algorithm recommended a sub-optimal action. It is evident that AC-OPT-FA results in 93% accuracy, i.e., 
on 93% of the state space, AC-OPT-FA recommended the optimal action with high probability (around 0.9 
in almost all states). On the other hand, RPAFA-2 achieved only 50% accuracy. 


12 









Node no.(s) 

Q(s, 0) 

Q(s, l) 

Qfs, 2) 

Qfs, 3) 

Qfs, 4) 

Qfs, 5) 

Qfs, 6) 

Qfs, 7) 

0 

-39.7583 

-41.4778 

-47.83 

N.A 

N.A 

N.A 

N.A 

N.A 

1 

-38.6203 

-39.9753 

-46.4778 

-42.83 

-40.7824 

N.A 

N.A 

N.A 

2 

-37.3559 

-38.3059 

-44.9753 

-41.4778 

-39.7583 

N.A 

N.A 

N.A 

3 

-35.951 

-36.451 

-43.3059 

-39.9753 

-38.6203 

N.A 

N.A 

N.A 

4 

-37.3559 

-34.39 

-41.451 

-38.3059 

-37.3559 

N.A 

N.A 

N.A 

5 

-35.951 

-36.451 

-39.39 

-36.451 

-35.951 

N.A 

N.A 

N.A 

6 

-37.3559 

-34.39 

-41.451 

-34.39 

-37.3559 

N.A 

N.A 

N.A 

7 

-35.951 

-36.451 

-39.39 

-36.451 

-35.951 

N.A 

N.A 

N.A 

8 

-41.451 

-34.39 

-37.3559 

N.A 

N.A 

N.A 

N.A 

N.A 

9 

-36.4778 

-38.5253 

-45.1728 

-50.7824 

-44.7583 

N.A 

N.A 

N.A 

10 

-34.9753 

-36.6948 

-43.5253 

-40.1728 

-37.83 

-45.7824 

-49.7583 

-43.6203 

11 

-33.3059 

-34.6609 

-41.6948 

-38.5253 

-36.4778 

-44.7583 

-48.6203 

-42.3559 

12 

-31.451 

-32.401 

-39.6609 

-36.6948 

-34.9753 

-43.6203 

-47.3559 

-40.951 

13 

-29.39 

-29.89 

-37.401 

-34.6609 

-33.3059 

-42.3559 

-45.951 

-42.3559 

14 

-31.451 

-27.1 

-34.89 

-32.401 

-31.451 

-40.951 

-47.3559 

-40.951 

15 

-29.39 

-29.89 

-32.1 

-29.89 

-29.39 

-42.3559 

-45.951 

-42.3559 

16 

-31.451 

-27.1 

-34.89 

-27.1 

-31.451 

-40.951 

-47.3559 

-40.951 

17 

-32.1 

-29.89 

-29.39 

-42.3559 

-45.951 

N.A 

N.A 

N.A 

18 

-33.5253 

-35.8681 

-42.7813 

-47.83 

-41.4778 

N.A 

N.A 

N.A 

19 

-31.6948 

-33.7424 

-40.8681 

-37.7813 

-35.1728 

-42.83 

-46.4778 

-39.9753 

20 

-29.6609 

-31.3804 

-38.7424 

-35.8681 

-33.5253 

-41.4778 

-44.9753 

-38.3059 

21 

-27.401 

-28.756 

-36.3804 

-33.7424 

-31.6948 

-39.9753 

-43.3059 

-36.451 

22 

-24.89 

-25.84 

-33.756 

-31.3804 

-29.6609 

-38.3059 

-41.451 

-34.39 

23 

-22.1 

-22.6 

-30.84 

-28.756 

-27.401 

-36.451 

-39.39 

-36.451 

24 

-24.89 

- 19 

-27.6 

-25.84 

-24.89 

-34.39 

-41.451 

-34.39 

25 

-22.1 

-22.6 

- 24 

-22.6 

-22.1 

-36.451 

-39.39 

-36.451 

26 

-27.6 

- 19 

-24.89 

-34.39 

-41.451 

N.A 

N.A 

N.A 

27 

-30.8681 

-33.4766 

-40.629 

-45.1728 

-38.5253 

N.A 

N.A 

N.A 

28 

-28.7424 

-31.0852 

-38.4766 

-35.629 

-32.7813 

-40.1728 

-43.5253 

-36.6948 

29 

-26.3804 

-28.4279 

-36.0852 

-33.4766 

-30.8681 

-38.5253 

-41.6948 

-34.6609 

30 

-23.756 

-25.4755 

-33.4279 

-31.0852 

-28.7424 

-36.6948 

-39.6609 

-32.401 

31 

-20.84 

-22.195 

-30.4755 

-28.4279 

-26.3804 

-34.6609 

-37.401 

-29.89 

32 

-17.6 

-18.55 

-27.195 

-25.4755 

-23.756 

-32.401 

-34.89 

-27.1 

33 

-14 

-14.5 

-23.55 

-22.195 

-20.84 

-29.89 

-32.1 

-29.89 

34 

-17.6 

- 10 

-19.5 

-18.55 

-17.6 

-27.1 

-34.89 

-27.1 

35 

- 15 

-14.5 

- 14 

-29.89 

-32.1 

N.A 

N.A 

N.A 

36 

-28.4766 

-42.7813 

-35.8681 

N.A 

N.A 

N.A 

N.A 

N.A 

37 

-26.0852 

-30.629 

-37.7813 

-40.8681 

-33.7424 

N.A 

N.A 

N.A 

38 

-23.4279 

-28.4766 

-35.8681 

-38.7424 

-31.3804 

N.A 

N.A 

N.A 

39 

-20.4755 

-26.0852 

-33.7424 

-36.3804 

-28.756 

N.A 

N.A 

N.A 

40 

-17.195 

-23.4279 

-31.3804 

-33.756 

-25.84 

N.A 

N.A 

N.A 

41 

-13.55 

-20.4755 

-28.756 

-30.84 

-22.6 

N.A 

N.A 

N.A 

42 

-9.5 

-17.195 

-25.84 

-27.6 

- 19 

N.A 

N.A 

N.A 

43 

- 5 

-13.55 

-22.6 

- 24 

-22.6 

N.A 

N.A 

N.A 


Table 2: Performance of Q-learning algorithm on the 44-node network graph 
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Node 

Value function 

MPA: Probability 

Node 

Value function 

MPA: Probability 

0 

-52.8351 

1 : 

0.975949 

22 

-28.674 

1 : 0.96989 

1 

-50.4398 

1 : 

0.969893 

23 

-26.2787 

1 : 0.969891 

2 

-48.0445 

1 : 

0.969893 

24 

-23.8834 

1 : 0.96989 

3 

-45.6493 

1 : 

0.969893 

25 

-21.4882 

1 : 0.96989 

4 

-43.254 

1 : 

0.969893 

26 

-19.0929 

0 : 0.513957 

5 

-40.8587 

1 : 

0.969893 

27 

-30.965 

1 : 0.975946 

6 

-38.4635 

1 : 

0.969893 

28 

-28.5698 

1 : 0.96989 

7 

-36.0682 

1 : 

0.969893 

29 

-26.1745 

1 : 0.96989 

8 

-33.6729 

0 : 

0.513958 

30 

-23.7792 

1 : 0.969891 

9 

-45.545 

1 : 

0.975946 

31 

-21.384 

1 : 0.96989 

10 

-43.1498 

1 

0.96989 

32 

-18.9887 

1 : 0.96989 

11 

-40.7545 

1 

0.96989 

33 

-16.5934 

1 : 0.96989 

12 

-38.3592 

1 

0.96989 

34 

-14.1982 

1 : 0.969891 

13 

-35.964 

1 : 

0.969891 

35 

-11.8029 

0 : 0.513957 

14 

-33.5687 

1 

0.96989 

36 

-23.675 

0 : 0.999869 

15 

-31.1734 

1 

0.96989 

37 

-21.2797 

0 : 0.993623 

16 

-28.7782 

1 

0.96989 

38 

-18.8845 

0 : 0.993624 

17 

-26.3829 

0 : 

0.513957 

39 

-16.4892 

0 : 0.993624 

18 

-38.255 

1 : 

0.975946 

40 

-14.0939 

0 : 0.993623 

19 

-35.8598 

1 

0.96989 

41 

-11.6987 

0 : 0.993623 

20 

-33.4645 

1 : 

0.969891 

42 

-9.30341 

0 : 0.993624 

21 

-31.0692 

1 

0.96989 

43 

-6.90814 

0 : 0.993624 


Table 3: Performance of the function approximation variant AC-OPT-FA on the 44-node network graph 


Node 

MPA: Probability 

Node 

MPA: Probability 

0 

1 : 0.504191 

22 

0 : 0.984263 

1 

2 : 0.330269 

23 

2 : 0.497062 

2 

1 : 0.496113 

24 

1 : 0.49855 

3 

0 : 0.330723 

25 

4 : 0.996063 

4 

3 : 0.331711 

26 

1 : 0.499916 

5 

3 : 0.50029 

27 

0 : 0.329259 

6 

2 : 0.332378 

28 

2 : 0.249082 

7 

2 : 0.498791 

29 

6 : 0.255686 

8 

2 : 0.499996 

30 

2 : 0.25075 

9 

3 : 0.330108 

31 

3 : 0.500413 

10 

1 : 0.201589 

32 

2 : 0.249539 

11 

3 : 0.491524 

33 

1 : 0.20215 

12 

2 : 0.249318 

34 

1 : 0.249613 

13 

6 : 0.253784 

35 

0 : 0.999038 

14 

1 : 0.249081 

36 

0 : 0.969508 

15 

1 : 0.249349 

37 

0 : 0.978052 

16 

3 : 0.249717 

38 

0 : 0.330178 

17 

3 : 0.33322 

39 

1 : 0.336035 

18 

3 : 0.330103 

40 

0 : 0.996688 

19 

0 : 0.20268 

41 

0 : 0.989921 

20 

0 : 0.202288 

42 

3 : 0.498579 

21 

7: 0.33527 

43 

1 : 0.49913 


Table 4: Performance of RPAFA-2 algorithm from | i|] on the 44-node network graph 
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